url <- "https://www.dropbox.com/scl/fi/w6cmx5fgl6y5e1sizeskf/t08_data_b1700_01.csv?rlkey=trab5xa3hpqhf01ruhx1w2ers&dl=1"
data <- read.csv(url)
rm(url)12 Data Manipulation (2.4)

12.1 Learning Outcomes
By the end of this tutorial, you should:
be confident in cleaning datasets, preparing them for analysis
understand how to manipulate variables, and create new variables, in preparation for analysis
12.2 Introduction
A lot of the data you will deal with in sport is ‘messy’. That is, it may not land on your screen in a format that is immediately ready for analysis.
Therefore, some of the key steps we need to take include:
making sure we can read and write the data files
making sure the variables are named consistently and accurately
dealing with missing data
dealing with outliers
making sure each variable type is defined correctly
we have all the variables that we need to conduct our analysis
The following sections describe how a number of these steps can be achieved in R. From my experience, this stage is likely to be the most time-consuming (and frustrating) part of any data analysis, so it’s worth making sure you fully understand the commands and processes outlined below.
12.3 The ‘tidyverse’ package
‘tidyverse’ is a collection of R packages designed for data manipulation, exploration, and visualization.
Two of the core packages within tidyverse, ‘readr’ and ‘dplyr’, provide useful functions for reading and writing data in various formats.
We encountered one part of the tidyverse package, ‘tibbles’, Section 9.8. However, there are a lot of other functions within this package that make it well worth learning.
12.4 Reading and Writing Data using ‘tidyverse’
12.4.1 Reading Data - ‘readr’
‘readr’ is a package within Tidyverse that provides functions to read data from common file formats, including CSV, TSV, and fixed-width files. It’s designed for fast and efficient data reading with user-friendly parsing and type conversion.
To read data from a CSV file, you can use the ‘read_csv()’ function as follows:
# Install and load tidyverse
install.packages("tidyverse") # only needed if you've not already installed tidyverse
library(code)
# Import a .csv file
data <- read_csv("data.csv")For other file formats, we can use the corresponding ‘tidyverse’ read functions, such as ‘read_tsv()’ for TSV files or ‘read_fwf()’ for fixed-width files.
12.4.2 Writing Data - ‘readr’
‘readr’ also provides functions for writing data to common file formats, like CSV and TSV. For example, the ‘write_csv()’ function can be used to write a data frame to a CSV file:
# Write data to a CSV file
write_csv(data, "output.csv")Similarly, use the ‘write_tsv()’ function to write data to a TSV file.
12.5 Data cleaning and transformation
‘dplyr’ is another core package within tidyverse that provides a set of tools for data manipulation, such as filtering, selecting, and summarising data.
The following procedures are some of the most common ones you’ll want to use when preparing your data prior to analysis.
12.5.1 Filtering rows with a specific value, and selecting specific columns
# Load the required packages
library(tidyverse)
# create a sample tibble
data <- tibble(
id = 1:5,
category = c("A", "B", "A", "B", "A"),
value = c(23, 45, 12, 78, 37)
)
# Filter rows with category 'A' and select columns 'id' and 'value'
filtered_data <- data %>%
filter(category == "A") %>%
select(id, value)
print(filtered_data)12.5.2 Selecting specific columns
# Select columns by name
selected_data <- data %>%
select(column1, column2, column3)12.5.3 Sorting your data by column values
# Sort data in ascending order
sorted_data <- data %>%
arrange(column_name)
# Sort data in descending order
sorted_data <- data %>%
arrange(desc(column_name))12.5.4 Creating new columns
# Add a new column with calculated values
new_data <- data %>%
mutate(new_column = column1 \* column2)12.5.5 Grouping data, and performing aggregations
# Group data by a column and calculate the mean of another column
grouped_data <- data %>%
group_by(column1) %>%
summarize(mean_value = mean(column2))12.5.6 Combining reading, writing, and manipulating Data
You can chain together reading, manipulating, and writing data using the ‘%>%’ pipe operator from ‘tidyverse’.
This allows for a more readable and efficient workflow.
In the following example, we read data from a CSV file, filter it, and then write it back to a new CSV file.
# Read data from a CSV file, filter, and write to a new CSV file
read_csv("data.csv") %>%
filter(category == "A") %>%
select(id, value) %>%
write_csv("filtered_data.csv")12.6 Data reshaping with ‘tidyr’
‘tidyr’ is another core tidyverse package that provides functions for cleaning and reshaping data. It helps create “tidy” data, where each variable is a column, and each observation is a row.
12.7 Gather multiple columns into key-value pairs (wide to long format)
# Gather columns 'column1', 'column2', and 'column3' into key-value pairs
long_data <- data %>%
gather(key = "variable", value = "value", column1, column2, column3)12.8 Spread key-value pairs into separate columns (long to wide format)
# Spread key-value pairs in 'variable' and 'value' columns into separate columns
wide_data <- data %>%
spread(key = variable, value = value)12.9 Separate a single column into multiple columns
# Separate 'column_name' into two new columns 'column1'
and 'column2', splitting by a delimiter (e.g., '-')
separated_data <- data %>%
separate(column_name, into = c("column1", "column2"), sep = "-")12.9.1 Unite multiple columns into a single column
# Unite columns into a new column 'column_name', with delimiter (e.g., '-')
united_data <- data %>%
unite(column_name, column1, column2, sep = "-")12.10 Alternative Approaches
In the preceding sections, we used the ‘tidyverse’ package to conduct various operations on a newly-imported dataset.
Before ‘tidyverse’ was introduced, it was also possible to conduct the same operations, using the following code.
You may still find these approaches useful.
12.10.1 Reading a .csv file
# load library
library(dplyr)
# import data
data <- read.csv('/Users/directory/filename.csv')
head(data) # display the first six rows
str(data) # inspect variable types12.10.2 Remove outliers (‘999’)}
# Check each row if 999 is present and, if so, replace with 'NA'
data[data == 999, ]12.10.3 Removing rows with missing data
data01 <- na.omit(data) # removes any row (observation) with missing data12.10.4 Remove a variable ‘X’ from dataset ‘data’
data<- subset(data, select = -c(X))12.10.5 Rename variables in dataset ‘data’ and create a new dataset ‘data02’
data02 <-rename(data, var_01 = A, var_02 = B) # create the new dataset
rm(data) # remove original dataset from memory12.10.6 Change variable types
data02$game_id = as.factor(data02\$game_id) # make game_id a factor
data02$opposition_id = as.factor(data02\$opposition_id) # make opposition_id a factor12.10.7 Create a logical true/false variable
In our current dataset, a home game is indicated as either ‘Yes’ or ‘No’. We want to transform this variable into a logical type, which in R means using TRUE or FALSE.
# first, we replace the existing values with TRUE or FALSE
data02$home_game[data02$home_game == 'Yes'] <- TRUE
data02$home_game[data02$home_game == 'No'] <- FALSE
# then, we change the variable type to logical
data02$home_game = as.logical(data02$home_game)
# finally, we check out dataset and make sure the variable type has changed
head(data02) # show first six rows
str(data02) # inspect variable types12.10.8 Calculate a new variable based on two existing variables
Often, you will want to create new variables based on the existing data within your dataset. This can be done as follows (assuming both variables are of the same type):
data02$var_03 <- data02$var_02-data02$var_0112.11 Activity: Importing, cleaning, and transforming a dataset
Note: all of the following steps should be conducted in a single R Script.
Working on the dataframe ‘data’, and with the dplyr library installed and loaded:
- Use the ‘head’, ‘str’ and ‘summary’ commands to overview the dataset.
- In this dataset, ‘999’ represents an outlier. Replace these values with ‘NA’.
- Remove any observations with missing data (‘NA’).
- Remove the variable ‘X’ from the dataset.
- Rename each of the variables to the format ‘var_id’, ‘var_a’ etc.
- Change the variable types as follows: var_id = factor, var_a = factor.
- Use the ‘str’ function to check variable types.
- Calculate a new variable, ‘var_e’, which is the sum of var_c + var_d.
- Write the dataset as a .csv file to an appropriate location where you might be able to access it at a later date.
Check your answers:
Show the answer
# Overview the dataset
head(data) id a b c d X
1 Winnipeg Jets David TRUE 46 16 27.73578
2 Minnesota Wild William FALSE 35 63 62.23405
3 New York Islanders Michael TRUE 13 2 34.34055
4 Los Angeles Kings Michael FALSE 80 44 999.00000
5 Boston Bruins James TRUE 28 16 55.25340
6 Dallas Stars Robert TRUE 97 20 24.44302
Show the answer
str(data)'data.frame': 1000 obs. of 6 variables:
$ id: chr "Winnipeg Jets" "Minnesota Wild" "New York Islanders" "Los Angeles Kings" ...
$ a : chr "David" "William" "Michael" "Michael" ...
$ b : logi TRUE FALSE TRUE FALSE TRUE TRUE ...
$ c : int 46 35 13 80 28 97 999 8 51 68 ...
$ d : int 16 63 2 44 16 20 65 17 9 38 ...
$ X : num 27.7 62.2 34.3 999 55.3 ...
Show the answer
summary(data) id a b c
Length:1000 Length:1000 Mode :logical Min. : 1.00
Class :character Class :character FALSE:482 1st Qu.: 28.00
Mode :character Mode :character TRUE :518 Median : 50.00
Mean : 60.25
3rd Qu.: 76.00
Max. :999.00
NA's :9
d X
Min. : 1.00 Min. : 3.064
1st Qu.: 24.00 1st Qu.: 40.913
Median : 49.00 Median : 49.647
Mean : 57.71 Mean : 69.190
3rd Qu.: 75.00 3rd Qu.: 60.380
Max. :999.00 Max. :999.000
NA's :8 NA's :10
Show the answer
# Replace 999 with NA in the dataset
data[data == 999] <- NA
# Remove observations with missing data
data_clean <- na.omit(data)
# Remove variable 'X'
data_clean$X <- NULL
# Rename variables
names(data_clean) <- c("var_id", "var_a", "var_b", "var_c", "var_d")
# Change variable types
data_clean$var_id <- as.factor(data_clean$var_id)
data_clean$var_a <- as.factor(data_clean$var_a)
# Use the 'str' function to check variable types
str(data_clean)'data.frame': 823 obs. of 5 variables:
$ var_id: Factor w/ 31 levels "Anaheim Ducks",..: 31 15 19 3 10 22 11 5 20 14 ...
$ var_a : Factor w/ 7 levels "David","James",..: 1 7 5 2 6 6 7 6 6 5 ...
$ var_b : logi TRUE FALSE TRUE TRUE TRUE FALSE ...
$ var_c : int 46 35 13 28 97 8 51 68 43 56 ...
$ var_d : int 16 63 2 16 20 17 9 38 32 89 ...
- attr(*, "na.action")= 'omit' Named int [1:177] 4 7 19 29 34 41 45 49 55 56 ...
..- attr(*, "names")= chr [1:177] "4" "7" "19" "29" ...
Show the answer
# Calculate 'var_e' as the sum of 'var_c' + 'var_d'
data_clean$var_e <- data_clean$var_c + data_clean$var_d